library(tidyverse)
library(patchwork) 

library(lubridate)

library(tsibble)
library(feasts)
library(forecast)

library(sandwich)
library(lmtest)

library(nycflights13)
library(blsR)
theme_set(theme_minimal())

Plot Flights and Weather Data

To start with this homework, you will be using the same data that Jeffrey uses in the lecture – US flights data. The data comes from the packages nycflights13.

Question Goals

Our goal with the tasks in this question are to try to familiarize yourself with some of the key programming concepts related to time series data – setting time indexes and key variables, grouping and indexing on those variables, and producing descriptive plots of data that is stored in a time series form.

Question 1 - Flights to nice places

In the package declarations, we have loaded the nycflights13 package. This provides three objects that we are going to use:

  1. flights;
  2. airports; and,
  3. weather.

You can investigate these objects more by issuing a ? before them, to access their documentation.

(1 point) Create Data

As stored, both flights and weather are stored a “plain” data frames. To begin, cast the flights dataset into a time series dataset, a tsibble.

  • Use the combination of year, month, day, hour, and minute to produce the time index. Call this newly mutated variable time_index. There is very good handling of dates inside of the lubridate package. There is a nice one-page cheetsheet that Rstudio makes available. For this task you might be looking for lubridate::make_datetime.
  • Although it may not generally be true, for this work, also assume that you can uniquely identify a flight by the carrier and the flight number, so you can use these two pieces of information to define the key. We need to define a key because in some cases there are more than one flight that leave at the same time – this is because the granularity of our time measure is at the minute and it is possible for two planes to leave within the same minute.

Answer

Data is created below. I checked for missing values.

There are more than one way to create the time index, I experimented with a couple and settled on the method below.

#flights
#?flights

# Checking missing value
if (nrow(flights) == sum(!is.na(flights$time_hour) & !is.na(flights$minute))) {
  print("No missing values for departure time y-m-d-h-m")
}
## [1] "No missing values for departure time y-m-d-h-m"
flights$time_index <- 
  flights$time_hour + lubridate::minutes(flights$minute)
flights_ts <- 
  tsibble::as_tsibble(
    flights,
    key = c(carrier, flight),
    index = time_index
  )
flights_ts
## # A tsibble: 336,776 x 20 [1m] <America/New_York>
## # Key:       carrier, flight [5,725]
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    11     3     1531           1540        -9     1653           1725
##  2  2013    11     4     1539           1540        -1     1712           1725
##  3  2013    11     5     1548           1540         8     1708           1725
##  4  2013    11     6     1535           1540        -5     1657           1725
##  5  2013    11     7     1549           1540         9     1733           1725
##  6  2013    11     8     1539           1540        -1     1706           1725
##  7  2013    11     9     1535           1540        -5     1714           1723
##  8  2013    11    10     1538           1540        -2     1718           1725
##  9  2013    11    11     1527           1540       -13     1710           1725
## 10  2013    11    12     1535           1540        -5     1709           1725
## # ℹ 336,766 more rows
## # ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>, time_index <dttm>

(1 point) Flights Per Month

Using ggplot, create a plot of the number of flights per month. What, if anything, do you note about the total volume of flights throughout the year? (Don’t worry if the plot doesn’t tell something interesting about the data. This data is pretty… boring.)

Answer

Comment:

The plot shows a relatively stable volume of flights throughout the year. Total flight volume is lowest in the winter months (January and February) and then slightly increases and remains high throughout the summer and fall, demonstrating a minor degree of annual seasonality common in the travel industry.

flights_ts %>%
  index_by(month_index = yearmonth(time_index)) %>%
  summarize(monthly_sum = n()) %>%
  ggplot(aes(x = month_index, 
             y = monthly_sum)) + 
  geom_line() + 
  labs(title = "Monthly number of flights departing NYC",
       x = "Month",
       y = "Number of Flights")

(1 point) The Tropics

Is there a difference in flights to tropical destinations throughout the year? Use the following concept of a tropical destination:

A tropical destination is one who is “in the tropics” – that is, they are located between the Tropic of Cancer and the Tropic of Capricorn.

  1. Using the airports dataset, create a new variable, is_tropical that notes whether the destination airport is in a tropical latitude.
  2. Join this airports data onto the flights data.
  3. Produce a plot that shows the volume of flights to tropical and non-tropical destinations, counted as a monthly total, throughout the year.
  1. First, try to do this using a group_by call that groups on month and is_tropical. Why does this not work? What is happening when grouping by month while also having a time index?
  2. Instead, you will need to look into tsibble::index_by and combine this with a lubridate “extractor” to pull the time object that you want out of the time_index variable that you created.
  3. To produce the plot, group_by(is_tropical), and index_by the month that you extract from your time_index. (This is a bit of a strange part of the geom_* API, but this might be a useful place to use the geom_step geometry to highlight changes in this series.)
  1. Comment on what you see in the flights to the tropics, compared to flight to non-tropical destinations.

Answer

Using group_by(month, is_tropical) with the original flights data doesn’t work because tsibble requires unique time index column (time_index) for its output. When we group by a lower-granularity variable (month) while the time index is of higher granularity (minute), there is no correct single time index for the resulting summarized rows

Comment on the plots There are airports in the flights data but not in airports. Looking into the setdiff leads to 4 airports that are all lcoated in the tropics. Hence I assigned all these flights to destination tropics.

I created 2 versions because tropical flights are small compared to others. So the second plot gives a better visual in terms of seasonality.

Overall, the volume of flights to non-tropical destinations is overwhelming and shows high stability throughout the year, with some seasonal variation.

Flights to tropical destinations clearly exhibit annual seasonality, peaking during the colder months (December/January) when people are seeking warm weather, and also showing a slight increase during the summer months.

#?airports
#airports

airports_augmented <- airports %>%
  mutate(is_tropical = ifelse(
    lat >= -23.5 & lat <= 23.5,
    "tropical", 
    "non-tropical"
  ))

#unique(airports_augmented$is_tropical)

flights_airports_ts <-
  flights_ts %>%
  left_join(airports_augmented, by = c("dest" = "faa")) %>%
  
  #    Handle the NA values created by missing airport codes (SJU, STT, BQN, PSE).
  #    Since these known missing airports are tropical, we replace NA with "tropical".
  mutate(
    is_tropical = replace_na(is_tropical, "tropical")
  ) %>%
  # Convert the final, cleaned column to a factor
  mutate(is_tropical = factor(is_tropical))
#flights  %>%
#  group_by(month, is_tropical)  %>%
#  summarise(total_flights = n())
flights_airports_ts %>%
  group_by(is_tropical) %>%
  index_by(month_index = yearmonth(time_index)) %>%
  summarize(total_flights = n()) %>%
  ggplot(aes(x = month_index, 
             y = total_flights,
             colors = is_tropical)) + 
  geom_step(linewidth = 1) + 
  labs(title = "Monthly Flight Volume to Tropical vs. Non-Tropical Destinations",
       x = "Month",
       y = "Total Flights",
       color = "Destination Type"
  )

flights_airports_ts %>%
  group_by(is_tropical) %>%
  index_by(month_index = yearmonth(time_index)) %>%
  summarise(
    total_flights = n()) %>%
  # Because one line is too low,
  # I use another to gain insight
  ggplot(aes(x = month_index, y = total_flights)) +
  geom_step(linewidth = 1) +
  
  # Add facetting to separate the two series vertically
  facet_grid(is_tropical ~ ., scales = "free_y") +
  
  labs(
    title = "Monthly Flight Volume by Destination Type",
    x = "Month",
    y = "Total Flights",
    color = "Destination Type" # This will no longer be used
  )

summary(flights_airports_ts$is_tropical) 
## non-tropical     tropical 
##       328467         8309
## Temp - to be removed

# 1. Get all unique destination codes from flights
#flight_dests <- unique(flights_ts$dest)

# 2. Get all unique airport codes from the augmented airports data
#airport_codes <- unique(airports_augmented$faa)

# 3. Find which flight destinations are NOT in the airport codes list
#missing_codes <- setdiff(flight_dests, airport_codes)

# Output the missing codes (there are usually a few)
#print(missing_codes)

Question 2 - Weather at New York Airports

Our goal in this question is to ask you to re-apply what you know about producing time series objects to very similarly structurd data.

(1 point) Create a time series of weather

Turn your attention to the weather data that is provided in the nycflights13::weather dataset. Produce a tsibble that uses time as a time index, and origin as a key for this data. You will notice that there are three origins, “EWR”, “JFK” and “LGA”.

(Hint: We anticipate that you are going to see the following error on the first time that you try to convert this data frame:

Error in `validate_tsibble()`:
A valid tsibble must have distinct rows identified by key and index.
Please use `duplicates()` to check the duplicated rows.
Run `rlang::last_error()` to see where the error occurred.

This is a very helpful error, with a helpful error message. If you see this error message, we suggest doing as the message suggests, and look into the duplicates() function to determine what the issue is. Once you have found the issue, (1) document the issue; (2) propose a solution that seems reasonable; and, (3) implement your proposed solution and keep it moving to answer this question.

Answer

Anticipated Issue:

The base nycflights13::weather data is said to contain duplicates. But I didn’t run into the error message, nor did the code

weather %>% duplicates(key = origin, index = time_hour)

produced non-zero lines. So the data looks okay.

# weather %>%
#   mutate(time_index = make_datetime(year, month, day, hour)) %>%
#   duplicates(key = origin, index = time_index) %>%
#   glimpse()
#weather
#?weather

# Okay if producing 0 row
weather %>%
  duplicates(key = origin, index = time_hour)
## # A tibble: 0 × 15
## # ℹ 15 variables: origin <chr>, year <int>, month <int>, day <int>, hour <int>,
## #   temp <dbl>, dewp <dbl>, humid <dbl>, wind_dir <dbl>, wind_speed <dbl>,
## #   wind_gust <dbl>, precip <dbl>, pressure <dbl>, visib <dbl>,
## #   time_hour <dttm>
weather_ts <- 
  as_tsibble(
    weather,
    key = origin,
    index = time_hour
  )

weather_ts
## # A tsibble: 26,115 x 15 [1h] <America/New_York>
## # Key:       origin [3]
##    origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
##    <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
##  1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4 
##  2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06
##  3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5 
##  4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7 
##  5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7 
##  6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5 
##  7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0 
##  8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4 
##  9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0 
## 10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8 
## # ℹ 26,105 more rows
## # ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## #   visib <dbl>, time_hour <dttm>
weather_ts <- 
  as_tsibble(
    weather,
    key = origin,
    index = time_hour
  )

weather_ts
## # A tsibble: 26,115 x 15 [1h] <America/New_York>
## # Key:       origin [3]
##    origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
##    <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
##  1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4 
##  2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06
##  3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5 
##  4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7 
##  5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7 
##  6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5 
##  7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0 
##  8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4 
##  9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0 
## 10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8 
## # ℹ 26,105 more rows
## # ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## #   visib <dbl>, time_hour <dttm>

(4 points) Plot temperature

With this weather data, produce the following figure of the temperature every hour, for each of the origins.

This figure contains five separate plots:

  • One that shows the entire year’s temperature data;
  • Two that show the month of January and July; and,
  • Two that show the first week of January and July.

You might think of these plots as “zooming in” on the time series to show more detail.

In your workflow, first create each of the plots. Then, use the patchwork package to compose each of these plots into a single figure.

After you produce this figure, comment on what you notice at each of these scales and the figure overall.

Answer

Commentary

Entire Year: The entire series is dominated by strong annual seasonality, showing a large, smooth sinusoidal pattern. All three airport temperatures track closely, indicating small temperature differences across the NYC metropolitan area.

Monthly (January and July): At this scale, the daily cycle becomes apparent. January shows low but fluctuating temperatures. July shows consistently warm temperatures, clearly illustrating the seasonal contrast.

Weekly/Daily: The daily cycle is dominant. Temperature rises during the day and falls at night in a smooth, predictable manner. High-frequency noise and short-term volatility are visible at this finest scale.

yearly_plot <- 
  weather_ts %>%
  ggplot(aes(
    x = time_hour,
    y = temp,
    color = origin
  )) +
  geom_line(linewidth = 0.25) + 
  labs(
    title = "Full Year Weather Data",
    x = "Time",
    y = "Tempreture Farenheit"
  )

yearly_plot

january_plot <- 
  weather_ts %>%
  filter(month(time_hour) == 1) %>%
  ggplot(aes(
    x = time_hour,
    y = temp,
    color = origin
  )) + 
  geom_line(linewidth = 0.2) + 
  labs(
    title = "2013 January Tempreture",
    x = "Time - Hour",
    y = "Tempreture in Farenheit"
  )

january_plot

july_plot <-   
  weather_ts %>%
  filter(month(time_hour) == 7) %>%
  ggplot(aes(
    x = time_hour,
    y = temp,
    color = origin
  )) + 
  geom_line(linewidth = 0.2) + 
  labs(
    title = "2013 July Tempreture",
    x = "Time - Hour",
    y = "Tempreture in Farenheit"
  )

july_plot

january_first_week <-
  weather_ts %>%
  filter((month(time_hour) == 1) & (day(time_hour) <= 7)) %>%
  ggplot(aes(
    x = time_hour,
    y = temp,
    color = origin
  )) + 
  geom_line(linewidth = 0.5) + 
  labs(
    title = "2013 January First Week Tempreture",
    x = "Time - Hour",
    y = "Tempreture in Farenheit"
  )

january_first_week

july_first_week <-
  weather_ts %>%
  filter((month(time_hour) == 7) & (day(time_hour) <= 7)) %>%
  ggplot(aes(
    x = time_hour,
    y = temp,
    color = origin
  )) + 
  geom_line(linewidth = 0.5) + 
  labs(
    title = "2013 July First Week Tempreture",
    x = "Time - Hour",
    y = "Tempreture in Farenheit"
  )

july_first_week

patch_fig <-
yearly_plot /
  (january_plot | july_plot) /
  (january_first_week | july_first_week) +
  plot_annotation(
    title    = 'Temperature at New York City Airports',
    subtitle = 'Many Different Views',
    tag_levels = 'A') & 
  labs(x = NULL, y = 'Temperature') 


patch_fig

(1 point) Hourly ACF

At the hourly level, produce an ACF and a lag plot at JFK. What do you learn from these plots? (Note that you can suppress all the coloring in the gg_lag call if you pass an additional argument, color = 1.)

Answer

Commentary

ACF Plot: The ACF coefficients show a slow decay with a clear wave-like pattern peaking every 24 lags. This indicates the series is highly persistent ( hence non-stationary) and possesses a strong daily seasonality.

Lag Plot: The points form a tight, upward-sloping cluster. This confirms visually the high positive correlation—the temperature today at a given hour is a reliable predictor of the temperature 24 hours ago.

hourly_acf <- 
  weather_ts %>%
    filter(origin == "JFK") %>%
    fill_gaps() %>%
    ACF(temp, lag_max = 48) %>%
    autoplot() +
    labs(
      title = "JFK Hourly Tempreture ACF"
    )
  
hourly_acf

hourly_lag <- 
  weather_ts %>%
  filter(origin == "JFK") %>%
  gg_lag(temp, .log = 9, color = 1) + 
  labs(title = "JFK Hourly Tempreture Lag Plot")
## Warning: `gg_lag()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_lag()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning in lag_geom(..., arrow = arrow): Ignoring unknown parameters: `.log`
hourly_lag

# Combine and display the plots
hourly_acf | hourly_lag

#hourly_acf | hourly_lag

(1 point) Weekly ACF

At the weekly level, produce an ACF and a lag plot of the weekly average temperature at JFK. What do you learn from these plots?

Answer

Comment The aggregation to weekly average removes the daily cycle seen in the hourly data. The ACF still decays slowly, indicating the series being non-stationary, but the most important feature is the large, periodic peak around Lag 52 (one year). This clearly highlights the annual seasonality of the average temperature.

weekly_acf <- 
  weather_ts %>%
#  fill_gaps() %>%
  filter(origin == "JFK") %>%
  index_by(week_index = yearweek(time_hour)) %>%
  summarize(
    avg_temp = mean(temp, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  ACF(avg_temp, lag_max = 52) %>%
  autoplot() + 
  labs(title = "JFK Weekly Average Temperature ACF")

weekly_acf

weekly_lag <- 
  weather_ts %>%
#  fill_gaps() %>%
  filter(origin == "JFK") %>%
  index_by(week_index = yearweek(time_hour)) %>%
  summarize(
    avg_temp = mean(temp, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  gg_lag(avg_temp, .lag = 9, color = 1) + 
  labs(title = "JFK Weekly Average Temperature Lag Plot")
## Warning in lag_geom(..., arrow = arrow): Ignoring unknown parameters: `.lag`
weekly_lag

## make the plot

(1 point) Monthly ACF

At the monthly level, produce an ACF plot of the monthly average temperature at JFK. What do you learn from these plots?

Answer

Comment: The monthly ACF shows decay and an annual seasonality (even with one year worth of data). This again points to non-stationarity.

monthly_acf <- 
  weather_ts %>%
#  fill_gaps() %>%
  filter(origin == "JFK") %>%
  index_by(month_index = yearmonth(time_hour)) %>%
  summarize(
    avg_temp = mean(temp, rm.na = TRUE),
    .group = "drop") %>%
  ACF(avg_temp) %>%
  autoplot() + 
  labs(
    title = "JFK 2013 Monthly Avg. Temp. ACF"
  )
  
monthly_acf

monthly_lag <- 
  weather_ts %>%
  filter(origin == "JFK") %>%
  index_by(month_index = yearmonth(time_hour)) %>%
  summarize(
    avg_temp = mean(temp, na.rm = TRUE),
    .group = "drop"
  ) %>%
  gg_lag(avg_temp, .lag = 9, color = 1) + 
  labs(title = "JFK 2013 Monthly ")
## Warning in lag_geom(..., arrow = arrow): Ignoring unknown parameters: `.lag`
monthly_lag

## make the plot

Question 3 - BLS Data

This is the last exercise for this assignment. Here, we’re going to do the same work that you have done twice before, but against “live” data that comes from the United States’ Bureau of Labor Statistics.

Recall that in the lecture, Jeffrey identifies the unemployment rate as an example of a time series. You can get to this data from the public web-site. To do so, head here:

What do you see when you get to the next page? A rectangular data series that has months on the columns, years on the rows, and values as the internals to the cells? :facepalm:

This motivates the idea of using the BLS’ data API. The data API provides consistently formatted JSON objects that can be converted to data of an arbitrary (that is, useful to us) formatting. Because the data is being provided in a JSON object, there is some work to coerce it to be useful, but we’ll find that there are so many people who are doing this same coercion that there are ready-made wrappers that will help us to do this work.

As an example, you can view how these JSON objects are formatted by navigating to an API endpoint in your browser. Here is the endpoint for the national unemployment: [link].

Let’s pull unemployment from the BLS data API.

  1. Register for an API key with the BLS. You can register for this from the BLS’ “Getting Started” page. They will then send you an API key to the email that you affiliate.
  2. Find the series that we want to access. Frankly, this is part of accessing this API that is the most surprisingly difficult – the BLS does not publish a list of the data series. From their Data Retrieval Tools page there are links to popular series, a table lookup, and a Data Finder. Elsewhere they provide pages that describe how series IDs are formatted, but finding series still requires considerable meta-knowledge.

For this assignment, consider the following three series:

  1. Total unemployment: LNS14000000
  2. Male unemployment: LNS14000001
  3. Female unemployment: LNS14000002

Our goal is to analyze these three series for the last 20 years.

To articulate the BLS API, we have found the blsR library to be the most effective (at the time that we wrote the assignment in 2022). Here are links to get you read into the package. Rather than providing you with a full walk-through for how to use this package to manipulate the BLS data API, instead a learning goal is for you to read these documents and come to an understanding of how the package works.

(2 points) Form a successful query and tidy of data

Your task is to create an object called unemployment that is a tsibble class, that contains the overall unemployment rate, as well as the unemployment rate for male and female people.

Your target dataframe should have the following shape but extend to the current time period.

year month time_index name    value
   <int> <int>      <mth> <chr>   <dbl>
 1  2000     1   2000 Jan overall   4  
 2  2000     1   2000 Jan male      3.9
 3  2000     1   2000 Jan female    4.1
 4  2000     2   2000 Feb overall   4.1
 5  2000     2   2000 Feb male      4.1
 6  2000     2   2000 Feb female    4.1
 7  2000     3   2000 Mar overall   4  
 8  2000     3   2000 Mar male      3.8
 9  2000     3   2000 Mar female    4.3
10  2000     4   2000 Apr overall   3.8

Answer

Query data

I defined current_year programatically and saved the qurery of the 3 series in one df.

Tidy data

Manually downloaded data with year and month as rows and three series as columns, i.e., in a wide format. Tidy data requires a long format, where each observation (a monthly unemployment reading) is a single row, and each variable (year, month, series_id, value) is a separate column.

So I pivot the 3 seires (col) names into one col with 3 values/factors.

Using API as opposed to manually downloaded data

Doing analysis on manually downloaded data requires a manual update (download, cleaning, and integration) every time the BLS releases new data. This process is non-reproducible, not scalable, and error-prone, making the analytic pipeline difficult to maintain. Using the blsR API package makes the process automated and programmatic.

Disclosure

Code is produced with help from AI but I checked every line of code and made many edits.

Sys.setenv(BLS_KEY = "https://doi.org/10.4324/9781315157375")

series_id_list <- c(
  overall = "LNS14000000",
  male    = "LNS14000001",
  female  = "LNS14000002"
)

current_year <- as.numeric(format(Sys.Date(), "%Y"))
current_year
## [1] 2025
# Query the BLS API for the last 20 years (e.g., 2004 to current year)
# Note: The output is a complex list structure
raw_bls_data <- blsR::get_n_series_table(
  series_id_list,
  # Request 20 years of data
  start_year = current_year - 20,
  end_year = current_year
)
## Year 2005 to 2025 is longer than 10 year API limit. Performing 3 requests.
## Warning: api_key is required for multiple series requests.
## api_key is required for multiple series requests.
## api_key is required for multiple series requests.
dim(raw_bls_data)
## [1] 248   5
class(raw_bls_data)
## [1] "tbl_df"     "tbl"        "data.frame"
unemployment <- raw_bls_data %>%
  # 1. Pivot the data from wide to long format
  pivot_longer(
    cols = starts_with("LNS"), # Select all columns starting with the Series ID prefix
    names_to = "series_id",
    values_to = "value"
  ) %>%
  
  # 2. Convert BLS period (e.g., "M01" -> "1") into a proper date object
  mutate(
    # Create the date by combining year and the period (month number)
    date = lubridate::ymd(paste0(year, "-", gsub("M", "", period), "-01")),
    
    # Clean up the series IDs into readable names
    name = case_when(
      series_id == "LNS14000000" ~ "overall",
      series_id == "LNS14000001" ~ "male",
      series_id == "LNS14000002" ~ "female",
      TRUE ~ series_id
    )
  ) %>%
  
  # 3. Finalize columns for tsibble casting
  mutate(
    year = lubridate::year(date),
    month = lubridate::month(date),
    time_index = yearmonth(date)
  ) %>%
  
  # 4. Cast to tsibble
  tsibble::as_tsibble(
    key = name,
    index = time_index
  ) %>%
  
  # 5. Select the required final columns and ensure value is numeric
  select(year, month, time_index, name, value) %>%
  mutate(value = as.numeric(value))
head(unemployment)
## # A tsibble: 6 x 5 [1M]
## # Key:       name [1]
##    year month time_index name   value
##   <dbl> <dbl>      <mth> <chr>  <dbl>
## 1  2005     1   2005 Jan female   5.1
## 2  2005     2   2005 Feb female   5.3
## 3  2005     3   2005 Mar female   5.1
## 4  2005     4   2005 Apr female   5.2
## 5  2005     5   2005 May female   5.2
## 6  2005     6   2005 Jun female   5.1

(1 point) Plot the Unemployment Rate

Once you have queried the data and have it successfully stored in an appropriate object, produce a plot that shows the unemployment rate on the y-axis, time on the x-axis, and each of the groups (overall, male, and female) as a different colored line.

Answer

Comment: ACF Plot: The ACF coefficients start at 1 and show a slow decay. This indicates non-stationarity and persistence: unemployment levels change gradually. There is no significant peak at Lag 12 (or multiples of 12), which may be because the data is seasonally adjusted.

Lag Plot: The points form a mostly tight, upward-sloping cluster. This visually confirms an high positive correlation between this month’s unemployment rate and next month’s rate, showing that the series is highly predictable from its immediate past. The few long lines indicates the rare events where employment jumped due to exogenous shocks (Covid, subprime financial crisis).

unemployment %>%
  ggplot(aes(x = time_index, y = value, color = name)) +
  geom_line(linewidth = 1) + 
  labs(title = "Unemployment by gender: 2005 - 2025",
       x = "time")

(1 point) Plot the ACF and Lags

This should feel familiar by now: Produce the ACF and lag plot of the overall unemployment series. What do you observe?

unemployment %>%
  filter(name == "overall") %>%
  ACF(value, lag_max = 240) %>%
  autoplot() + 
  labs(title = "Overall Unemployment Rate ACF")

unemployment %>%
  filter(name == "overall") %>%
  gg_lag(value, .lag = 9, color = 1
         ) + 
  labs(title = "Overall Unemployment Rate Lag Plot")
## Warning in lag_geom(..., arrow = arrow): Ignoring unknown parameters: `.lag`